Method for processing data

ABSTRACT

A method and device for translating a program to a system including at least one first processor and a reconfigurable unit. Code portions of the program which are suitable for the reconfigurable unit are determined. The remaining code of the program is extracted and/or separated for processing by the first processor.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.10/480,003, filed on Jun. 18, 2004, which is a national phase ofInternational Application No. PCT/EP02/06865, filed on Jun. 20, 2002,which claims priority to German Patent Application No. DE 101 29 237.6,filed on Jun. 20, 2001, the entire contents of each of which areexpressly incorporated herein by reference thereto.

FIELD OF THE INVENTION

The present invention relates to data processing. In particular, thepresent invention relates to traditional, i.e., conventional andreconfigurable processor architectures as well as methods therefor,which permit translation of a classical high-level language (PROGRAM)such as Pascal, C, C++, Java, etc., in particular onto a reconfigurablearchitecture. The present invention relates in particular to integrationand/or close coupling of reconfigurable processors with standardprocessors, data exchange, and synchronization of data processing.

BACKGROUND INFORMATION

A conventional processor architecture (PROCESSOR) is understood in thepresent case to refer to sequential processors having a von Neumannarchitecture or a Harvard architecture, such as controllers or CISCprocessors, RISC processors, VLIW processors, DSP processors, etc.

The term “reconfigurable target architecture” is understood in thepresent case to refer to modules (VPUs) having a function and/orinterconnection that is repeatedly configurable, in particularconfigurable without interruption during run time, in particularintegrated modules having a plurality of one-dimensionally ormultidimensionally arranged arithmetic and/or logic and/or analog and/ormemory modules, in particular also coarse-grained modules (PAEs) whichare interlinked directly or via a bus system.

The generic class of such modules includes in particular systolicarrays, neural networks, multiprocessor systems, processors having aplurality of arithmetic units and/or logic cells, interlinking andnetwork modules such as crossbar switches as well as known modules ofthe generic types FPGA, DPGA and XPUTER, etc. In this connection,reference is made in particular to the following patents and patentapplications: P 44 16 881.0-53, DE 197 81 412.3, DE 197 81 483.2, DE 19654 846.2-53, DE 196 54 593.5-53, DE 197 04 044.6-53, DE 198 80 129.7, DE198 61 088.2-53, DE 199 80 312.9, PCT/DE 00/01869, DE 100 36 627.9-33,DE 100 28 397.7, DE 101 10 530.4, DE 101 11 014.6, PCT/EP 00/10516, EP01 102 674.7, DE 196 51 075.9-53, DE 196 54 846.2-53, DE 196 54593.5-53, DE 197 04 728.9, DE 197 07 872.2, DE 101 39 170.6, DE 199 26538.0, DE 101 42 904.5, DE 101 10 530.4. These are herewith incorporatedto the full extent for disclosure purposes.

This system may be designed in particular as a (standard) processor ormodule and/or may be integrated into a semiconductor (system on chip,SoC).

Reconfigurable modules (VPUs) of different generic types (such as PACTXPP technology, Morphics, Morphosys, Chameleon) are largely incompatiblewith existing technical environments and programming methods.

Programs for these modules are typically incompatible with existingprograms of CPUs. A considerable development expense is thus necessaryfor programming, e.g., in particular for modules of the generic typesMorphics, Morphosys. Chameleon already integrates a standard processor(ARC) on more or less reconfigurable modules. This makes approaches forprogramming tools available. However, not all technical environments aresuitable for the use of ARC processors; in particular there are oftenexisting programs, code libraries, etc. for any indeterminate otherCPUs.

In internal experiments it has been found that there are certain methodsand program sequences which may be processed better using areconfigurable architecture rather than a conventional processorarchitecture. Conversely, there are also such methods and programsequences which are better executed using a conventional processorarchitecture. It would be desirable to provide a sequence partitioningto permit appropriate optimization.

Conventional translation methods for reconfigurable architectures do notsupport any forwarding of codes to any standard compilers for generatingobject codes for any desired PROCESSOR. Ordinarily, the PROCESSOR isfixedly defined within the compiler.

In addition, there are no scheduling mechanisms for reconfiguring theindividual configurations generated for VPUs. In particular there are noscheduling mechanisms for configuration of independently extractedportions or for individual partitions of extracted portions.Conventional corresponding translation methods are described in thedissertation Übersetzungsmethoden f{umlaut over (r)}strukturprogrammierbare Rechner [Translation Methods for StructureProgrammable Computers], by Dr. Markus Weinhardt, 1997, for example.

Several conventional methods are known for partitioning array CODE e.g.,João M. P. Cardoso, Compilation of Java™ Algorithms onto ReconfigurableComputing Systems with Exploitation of Operation-Level Parallelism,Ph.D. dissertation, Universidade Técnica de Lisboa (UTL), 2000.

However, these methods are not embedded into any complete compilersystems. Furthermore, these methods presuppose complete control of thereconfiguration by a host processor, which involves considerablecomplexity. The partitioning strategies are designed for FPGA-basedsystems and therefore do not correspond to any actual processor model.

SUMMARY

An object of the present invention is to provide a method for acommercial application.

A reconfigurable processor (VPU) is thus designed into a technicalenvironment which has a standard processor (CPU) such as a DSP, RISC,CISC processor or a (micro)controller. The design may be accomplishedaccording to an embodiment of the present invention in such a way thatthere is a simple and efficient connection. One resulting aspect is thesimple programmability of the resulting system. Further use of existingprograms of the CPU as well as the code compatibility and simpleintegration of the VPU into existing programs are taken into account.

A VPU (or a plurality of VPUs, although this need not be mentionedspecifically each time) is coupled to a preferred CPU (or a plurality ofCPUs, although this need not be mentioned specifically each time) sothat it assumes the position and function of a coprocessor (or aplurality of coprocessors that respond optionally). This functionpermits a simple tie-in into existing program codes according to thepre-existing methods for working with coprocessors according to therelated art.

The data exchange between the CPU and VPU according to the presentinvention may be accomplished by memory coupling and/or IO coupling. TheCPU and VPU may share all resources; in particular embodiments, it isalso possible for the CPU and VPU to jointly use only a portion of theresources and to make other resources available explicitly and/orexclusively for a CPU or VPU.

To perform a data exchange, data records and/or configurations may becopied and/or written/read in memory areas particularly provided forthose purposes and/or corresponding basic addresses may be set in such away that these point to the particular data areas.

To control the coprocessor, preferably a data record which contains thebasic settings of a VPU, e.g., certain basic addresses are provided, forexample. In addition, status variables may also be provided fortriggering and for function control of a VPU by a CPU and foracknowledgments from a VPU to a CPU. This data record may be exchangedvia a shared memory (RAM) and/or via a shared peripheral address space(IO).

For synchronization of the CPU and VPU, unilaterally or mutually actinginterrupt methods (which are implemented, for example, by signaltransfer over interrupt lines and/or interrupt inputs that arespecifically dedicated and/or designed for this purpose) and/or thesynchronization is accomplished by polling methods. Furthermore,interrupts may also be used for synchronization of data transfers and/orDMA transfers.

In an example embodiment that is particularly preferred, a VPU isstarted by a CPU and thereafter operates preferably independently of theapplication.

A preferred design in which the VPU provides its own mechanisms forloading and controlling configurations is particularly efficient. Thegeneric type of these VPUs include, for example, PACT XPP and Chameleon.The circuits according to the present invention permit a method ofoperation in which the configurations of the VPU are loaded into amemory together with the program to be executed by the CPU. Duringexecution of the program, the CPU may refer the VPU to the memorylocations (e.g., by giving the addresses or pointers), each containingconfigurations to be executed. The VPU may then load the configurationsindependently and without further influence by the CPU. The execution bythe CPU starts immediately or optionally by means of additionalinformation (e.g., interrupt and/or start instruction).

In a particularly preferred expansion, the VPU may read and write dataindependently within a memory.

In a particularly preferred expansion, the VPU may also independentlyload new configurations out of the memory and may perform newconfigurations as needed without requiring any further influence by theCPU.

These embodiments permit extensive operation of VPUs independently ofCPUs. Only a synchronization exchange between CPU and VPU, which maypreferably take place bidirectionally, is provided in addition tocoordinate data processing operations and/or executions ofconfigurations.

It has also been recognized that methods of data processing may and/orshould preferably be designed so that particularly suitable portions(VPU code) of the program to be translated are identified and extractedfor the reconfigurable target architecture (VPU) to permit particularlyefficient data processing. These portions are to be partitionedaccordingly and the time sequence configuration of the individualpartitions is to be controlled.

The remaining portions of the program may be translated onto aconventional processor architecture (PROCESSOR). This is preferablyaccomplished in such a way that these portions are output as high-levellanguage code in a standard high-level language (e.g., ANSI C) so thatan ordinary high-level language compiler (optionally pre-existing) isable to process it without difficulty.

It should also be pointed out that these methods may also be used forgroups of a plurality of modules.

In particular a type of “double buffering” may be used for aparticularly simple and at the same time rapid reconfiguration in whicha plurality of VPUs are provided, so that a portion of the VPUs may bereconfigured at a time when another portion is computing and perhaps yetanother may be inactive, for example. Data links, trigger links, statuslinks, etc. are exchanged among a plurality of VPUs in a suitable way,and are optionally wired through addressed buses and/ormultiplexers/demultiplexers according to the VPUs that are currentlyactive and/or to be reconfigured.

One advantage of this method is that existing code which has beenwritten for any processor, may continue to be used by involving a VPU,and no modifications or only comparatively minor modifications need bemade. The modifications may also be performed incrementally, with morecode being transferred gradually from the processor to the VPU. Theproject risk drops, and there is a significant increase in clarity. Itshould be pointed out that such a successive transfer of more and moretasks to the VPU, i.e., to the integral, multidimensional, partiallyreconfigurable and in particular coarse-grained field of elements, has aspecial meaning on its own and is regarded as being inventive per sebecause of its major advantages in system porting.

In addition, the programmer is able to work in his/her accustomeddevelopment environment and need not become adjusted to a novel andpossibly foreign development environment.

A first aspect of the present invention may be seen in the fact that aPROCESSOR is connected to one or more VPUs so that an efficient exchangeof information is possible, in particular in the form of datainformation and status information.

Importance may also be attributed to the configuration of a conventionalprocessor and a reconfigurable processor so that exchange of datainformation and/or status information between same is possible duringrunning of one or more programs and/or without having to significantlyinterrupt data processing on the reconfigurable processor and/or theconventional processor in particular; importance may also be attributedto the design of such a system.

For example, one or all of the following linking methods and/or meansmay be used:

-   a) shared memory,-   b) network (e.g., bus systems such as PCI bus, serial buses such as    Ethernet, for example),-   c) connection to an internal register set or a plurality of internal    register sets,-   d) other memory media (hard drive, flash ROM, etc.).

In principle, the VPU and/or the CPU may also independently access thememory without the assistance of a DMA. The shared memory may also bedesigned as a dual port memory or a multiport memory in particular.Additional modules may be assigned to the system, and in particularreconfigurable FPGAs may be used to permit fine-grained processing ofindividual signals or data bits and/or to make it possible to establishflexible adaptable interfaces (e.g., various serial interfaces (V24,USB, etc.), various parallel interfaces, hard drive interfaces,Ethernet, telecommunications interfaces (a/b, TO, ISDN, DSL, etc.)).

The structure of a VPU is known, for example, from the patents andpatent applications described above. Attempts to arrive at alternativemodule definitions have become known under the name Chameleon, forexample. VPUs may be integrated into a system in various ways. Forexample, a connection to a host processor is possible. Depending on themethod, the host processor may assume the configuration control(HOSTRECONF) (e.g., Chameleon) or there may be, for example, a dedicatedunit (CT) for controlling the (re)configuration.

Accordingly, the translator according to the method described heregenerates the control information for the reconfiguration for a CTand/or a HOSTRECONF.

The translation principle may be embodied in such a way that by using apreprocessor, the portions that may be mapped efficiently and/orreasonably on the particular certain VPU(s) may be extracted from aPROGRAM via a PREPROCESSOR. These portions are transformed into a formatsuitable for VPUs (NML) and are then translated further into an objectcode.

The remaining code and/or the extracted code is expanded according toexperience at or with respect to the location of the code portions thatare missing due to the extraction, by adding an interface code whichcontrols communication between PROCESSOR(s) and VPU(s) according to thearchitecture of the target system. The remaining code which has beenoptionally expanded may preferably be extracted. This may take place asfollows, for example:

... Code ... # START_EXTRACTION Code to be extracted # END_EXTRACTION... Code ... “// START_EXTRACTION” denotes the start of a code to beextracted. “// END_EXTRACTION” denotes the end of a code to beextracted.

In such a case, the unit for implementation of the program inconfiguration codes is designed to recognize the hints and/orimplementation instructions.

It is also possible for portions of the PROGRAM to be implementeddirectly in NML for extraction by calling NML routines and to jump tothe NML routines using calls. This may take place as follows, forexample:

a) NML code ... procedure EXAMPLE begin ... end ... b) PROGRAM code ...Code ... call EXAMPLE      // call of the NML code ... Code ...

In this case, the unit for implementation is designed to tie NML programportions, i.e., program portions for execution in and/or on areconfigurable array, into a larger program.

Alternatively and/or additionally, extraction from an object-orientedclass is also possible.

Macros suitable for a VPU are defined as a class in the class hierarchyof an object-oriented programming language. The macros may becharacterized by annotation so that they are recognized as codesintended for a VPU and are processed further accordingly—even in higherhierarchies of the language.

Within a macro, a certain networking and/or mapping is preferablypredetermined by the macro which then determines the mapping of themacro onto the VPU.

Instantiation and chaining of the class results in implementation of thefunction which includes a plurality of macros on the VPU. In otherwords, instantiation and chaining of macros define the mapping andinterconnection of the individual operations of all macros on the VPUand/or the interconnection and/or data exchange between the VPU and CPU,if necessary.

The interface codes are added in instantiation. Chaining describes thedetailed mapping of the class on the VPU.

A class may also be formed as a call of one or more NML routines, forexample.

a) Class code ... class EXAMPLE begin ... end ... b) PROGRAM code ...Code ... EXAMPLE var( )       // instantiation of the class ... Code ...

Extraction by analysis is also possible. Portions within the PROGRAMwhich may be mapped efficiently and/or appropriately on the VPU arerecognized using the analytical methods adapted to the particular VPU.

These portions are extracted from the PROGRAM.

An analytical method suitable for many VPUs, for example, is to createdata flow graphs and/or control flow graphs from the PROGRAM. Thesegraphs may then be analyzed automatically with regard to their possiblepartitioning and/or mapping onto the target VPU. In this case, theportions of the graphs generated and/or the corresponding PROGRAMPORTIONS, which may be partitioned and/or mapped sufficiently well, areextracted. To do so, a partitionability and/or mappability analysis maybe performed, evaluating the particular property. Partitioning andextraction of the program portions on the VPU as well as theintroduction of the interfaces provided are then performed according tothis evaluation.

Reference is made here explicitly to the analytical methods described inGerman Patent Application DE 101 39 170.6 which may be used, forexample. The aforementioned patent application is herewith incorporatedto full extent for disclosure purposes.

One possible analytical method is also provided by recognition ofcertain data types.

Different data types are more or less suitable for processing on a VPU.For example, complex pointer arithmetics, i.e., pointer-based dataaddressing (pointer) is difficult to map onto a VPU, whereas arrays arevery easily mappable.

Therefore, the particular suitable data types and at least essentialportions of their data processing may be transferred largelyautomatically or manually to a VPU according to the present inventionand extracted accordingly. The extraction is performed in response tothe occurrence of certain data types and/or data operations.

It should be pointed out here that additional parameters assigned to thedata types may provide additional information for determining theexecutability and/or execution performance on a VPU and therefore mayalso be used to a significant extent for extraction. For example, thesize of the arrays to be computed plays a significant role. It isusually not worthwhile to perform computations for small arrays on a VPUbecause the resources needed for synchronization and data exchangebetween the CPU and VPU may be excessive. However, it should again bepointed out that small arrays for which computations are performedparticularly frequently within a loop are nevertheless very suitable forVPUs if the loop is computed almost completely on the VPU. Large arrays,however, may usually be computed particularly efficiently on a VPU.

In addition, it should be pointed out that certain data types may becreated by a specially adapted compiler or, optionally, by a user (e.g.,by using TYPE in Pascal), these being particularly suitable for VPUs anddata processing of which is then executed on a VPU.

For example, there may be the following data types:

TYPE stream1 of byte [ ]; TYPE stream2 of byte [0..255;

The term “stream” defines a data stream usually of a great, possibly notpreviously known, and/or infinite, length. Stream1 here had a lengththat was not previously known. For example, an FIR filter programmedwith this type of data (or, for example, an FFT or DCT) could be mappedautomatically onto a VPU—and optionally rolled out. The reconfigurationis then typically and preferably performed in response to othermechanisms than the data stream, e.g., by counters, comparators,CT-controlled and/or by timeout. For example, if wave configuration orsome other reconfiguration is to be triggered here, then thischaracterization of a data packet, in particular data bytes, promptedvia conventional methods may be the last to take place to trigger thereconfiguration after and/or with the run-through of this data packet,which is characterized as the last data packet.

stream2 defines a data stream having the length of 256 bytes here, whichmay be treated like stream1, but has the property of ending after 256bytes and thus possibly triggering a reconfiguration after the end inthe sense of the patents cited above by the same applicant. Inparticular a wave reconfiguration, e.g., according to DE 197 04 728.9,DE 199 26 538.0, DE 102 06 857.7, DE 100 28 397.7 may be triggered withthe occurrence of the last data byte and the particular PAE processingthe byte may be reconfigured with the processing of this last data byte.

A translation of the extracted code according to NML which is suitablefor the implemented VPU may preferably be performed.

For data flow-oriented VPUs, a data flow graph and/or a control flowgraph may be created automatically, for example. The graphs are thentranslated into NML code.

Corresponding code portions such as loops may then be translated via adatabase (lookup) or ordinary transformations may be performed. For codeportions, macros may also be provided and are then used furtheraccording to the IKR disclosed in the aforementioned patentapplications.

Modularization according to PACT13 (PCT/DE00/01869), FIG. 28 may also besupported.

Optionally, the mapping and/or its preparation may already take place onthe VPU, e.g., by performing the placement of the required resources androuting the connections (place and route). This may be done, forexample, according to the conventional rules of placement and routing.

It is also possible to analyze the extracted code and/or the translatedNML code for its processing efficiency by using an automatic analyticalmethod. The analytical method is preferably selected so that theinterface code and the performance influences derived from it are alsoincluded in the analysis at a suitable point. Suitable analyticalmethods are described, for example, in the patent applications by thepresent patent applicant as cited above.

The analysis is optionally performed via complete translation andimplementation on the hardware system by executing the PROGRAM andperforming measurements using suitable conventional methods.

It is also possible that, based on the analyses performed, variousportions that have been selected for a VPU by extraction might beidentified as unsuitable. Conversely, the analysis may reveal thatcertain portions that have been extracted for a PROCESSOR would besuitable for execution on a VPU.

An optional loop which leads back to the extraction portion afteranalysis based on suitable decision criteria to execute this loop withextraction specifications according to the analysis permits optimizationof the translation results. This is thus an iteration. This procedure ispreferred.

A loop may be introduced into the compiler run at various points.

The resulting NML code is to be partitioned according to the propertiesof the VPU used as needed, i.e., broken down into individual portionswhich may be mapped into the particular resources available.

A plurality of such mechanisms, in particular those based on graphicanalysis, are known per se according to the related art. However, apreferred variant is based on analysis of the program sources and isknown by the term temporal partitioning. This method is described in theaforementioned Ph.D. thesis by Cardoso, which is herewith incorporatedto the full extent for disclosure purposes.

Partitioning methods, regardless of the type, are to be adaptedaccording to the type of VPU used. When using VPUs which allow storageof intermediate results in registers and/or memories, the tie-in of thememories for storage of data and/or states is to be taken into accountthrough the partitioning. The partitioning algorithms (e.g., thetemporal partitioning) are to be adapted accordingly. Usually the actualpartitioning and scheduling are greatly simplified and made possible ina reasonable manner for the first time through these patents.

Many VPUs offer the possibility of differential reconfiguration. Thismay be used when only relatively few changes within the configuration ofPAEs are necessary in a reconfiguration. In other words, only thechanges in a configuration in comparison with the present configurationare reconfigured. The partitioning in this case may be done so that thepossibly differential configuration following a configuration containsonly the required configuration data and does not constitute a completeconfiguration. It is possible to also take into account theconfiguration data overhead for analytical purposes in evaluating thepartitioning efficiency.

The scheduling mechanisms for the partitioned codes may be expanded sothat scheduling is controlled by acknowledgment messages of the VPU tothe particular unit being reconfigured (CT and/or HOSTRECONF). Inparticular, the resulting possibility of a conditional execution, i.e.,explicit determination of the subsequent partition by the state of theinstantaneous partition, is utilized in partitioning. In other words, itis possible to optimize the partitioning so that conditional executionssuch as IF, CASE, etc. are taken into account.

If VPUs which have the ability to transmit status signals between PAEsare used, the PAEs responding to the particular states transmittedand/or cooperating in their processing, then within the partitioning andthe scheduling, the additional execution may also be taken into accountwithin the configuration of PAEs, i.e., without the necessity ofcomplete or partial reconfiguration due to an altered conditionalprogram run.

In addition, scheduling may support the possibility of preloadingconfigurations during the run time of another configuration. A pluralityof configurations may also be preloaded speculatively, i.e., withoutbeing certain that the configurations are needed at all. Throughselection mechanisms, the configurations that are used may then beselected at run time (see also the example NLS in DE 100 50 442.6, EP 01102 674.7).

According to an additional or alternative variant, data processingwithin the VPU connected to the CPU requires exactly the same number ofcycles as data processing within the computation pipeline of the CPU. Inthe case of today's high-performance CPUs having a plurality of pipelinestages (>20) in particular, this concept may be used ideally. Thespecial advantage is that no separate synchronization measures such asRDY/ACK are necessary and/or no adaptation of opcodes to the registercontrol is necessary. In this method, the compiler must ensure that theVPU maintains the required number of cycles and that data processing maybe balanced by the insertion of delay stages such as a fall-throughFIFO, such as that described in other patent applications cited above.

The code that is output is usually completely processable on theparticular downstream compilers, preferably without any additionalmeasures. If necessary, compiler flags and constraints may be generatedfor controlling downstream compilers, in which case the user mayoptionally add his or her own specifications and/or may modify thespecifications generated. The downstream compilers do not require anysignificant modifications, so that standard conventional tools may inprinciple be used.

The method proposed here is thus suitable in particular as apreprocessor and/or as a processor method, for example, upstream fromcompilers and development systems. However, it should be pointed outexplicitly that instead of and/or together with the translator describedpreviously, compilers according to PACT11 (DE 101 39 1706; US2003/0056202) may also be involved in principle.

An FPGA may be connected to the architecture described here, inparticular directly to the VPU, to permit fine-grained data processingand/or to permit a flexibly adaptable interface (e.g., various serialinterfaces (V24, USB, etc.), various parallel interfaces, hard driveinterfaces, Ethernet, telecommunications interfaces (a/b, TO, ISDN, DSL,etc.)) to additional modules. The FPGA may be configured from the VPUarchitecture, in particular by the CT and/or by the CPU. The FPGA may beoperated statically, i.e., without run time reconfiguration, and/ordynamically, i.e., with run time reconfiguration.

Providing an interface code has already been mentioned. The interfacecode which is inserted into the extracted code may be predefined byvarious methods. The interface code is preferably stored in a databasewhich is accessed. The unit for implementation may be designed to takeinto account a selection, e.g., by the programmer, in which theappropriate interface code is selected, e.g., based on instructions inthe PROGRAM or by compiler flags. An interface code suitable for theimplementation method of the VPU/CPU system, used in each case, may beselected.

The database itself may be created and maintained by various methods. Afew examples will be presented here to illustrate the possibilities:

-   a) The interface code may be predefined by the supplier of the    compiler for certain connection methods between the VPU and CPU(s).    This may be taken into account in the organization of the database    by keeping an appropriate memory device ready and available for this    information.-   b) The interface code may be written by the user himself, who    determined the system structure, or it may be modified from existing    (exemplary) interface code and added to the database. The database    is preferably designed to be user-modifiable in this regard to allow    the user to modify the database.-   c) The interface code may be generated automatically by a    development system using which the system structure of the VPU-CPU    system has been planned and/or described and/or tested, for example.

The interface code is usually preferably designed in such a way that itconforms to the requirements of the programming language in which theextracted code was written and into which the interface code is to beinserted.

Debugging and Integration of the Tool Sets

Communication routines may be introduced into the interface codes tosynchronize various development systems for the PROCESSOR and the VPU.In particular, code for the particular debugger (e.g., according toPACT11) may also be included.

The interface code is designed to control and/or enable data exchangebetween the PROCESSOR and the VPU. It is therefore a suitable andpreferred interface for controlling the particular development systemsand debuggers. For example, it is possible to activate a debugger forthe PROCESSOR as long as the data is being processed by the processor.As soon as the data is transferred via the interface code to one or moreVPUs, a debugger for the VPUs is to be activated. If the code is sentback to the PROCESSOR, the PROCESSOR debugger is again to be activated.It is therefore also possible and preferable to handle such sequences byinserting control codes for debuggers and/or development systems intothe interface code.

Communication and control between the different development systemsshould therefore preferably be handled via control codes introduced intothe interface codes of the PROCESSOR and/or VPU. The control codes maylargely correspond to existing standards for the control of developmentsystems.

Administration and communication of the development systems arepreferably handled as described in the interface codes, but they mayalso be handled separately from them (if appropriate) according to acorresponding similar method.

In many programming languages, in particular in sequential languagessuch as C, a precise chronological order is predetermined implicitly bythe language. In the case of sequential programming languages, this isaccomplished by the sequence of individual instructions, for example. Ifrequired by the programming language and/or the algorithm, the timeinformation may be mapped onto synchronization models such as RDY/ACKand/or REQ/ACK or to a time stamp method.

For example, a subsequent FOR loop may be run and iterated only when avariable (inputstream here) is acknowledged with a RDY in each run. Ifthere is no RDY, the loop run is stopped until RDY is received:

while TRUE   s := 0   for i: 1 to 3     s := s + inputstream;

The property of sequential languages of being controlled only byinstruction processing is connected to the data flow principle ofcontrolling processing through the data flow, i.e., the existence ofdata. In other words, an instruction and/or a statement (e.g.,s:=s+inputstream;) is processed only when it is possible to execute theoperation and the data is available.

It is noteworthy that this method does not usually result in any changein the syntax or semantics of a high-level language.

More complex functions of a high-level language such as looping areimplemented by macros. The macros are predefined by the compiler and areinstantiated at the translation time.

Macros are constructed either of simple language constructs of thehigh-level language or they are constructed at the assembler level.Macros may be parameterized to permit simple adaptation to the algorithmdescribed (see also PACT11).

A standard processor, e.g., an RISC, CISC or DSP (CPU), is thus linkedto a reconfigurable processor (VPU).

Two different linkage variants, but preferably variants that may also beimplemented simultaneously, may be described as follows.

A first variant includes a direct link to the instruction set of a CPU(instruction set linkage).

A second variant involves linkage via tables in the main memory.Tabulation means are therefore provided in this variant.

Free unused instructions are usually present within an instruction set(USA) of a CPU. One or more of these free unused instructions is nowused to control VPUs (VPUCODE).

A configuration unit (CT) of a VPU is triggered by the decoding of aVPUCODE, and executes certain sequences as a function of the VPUCODE.There is thus a responsive CT for VPU decoding.

A VPUCODE may, for example, trigger the loading and/or execution ofconfigurations by the configuration unit (CT) for a VPU.

In an expanded embodiment, a VPUCODE may be translated to different VPUinstructions via a translation table which is preferably managed by theCPU, or alternatively it may also be managed by the CPU, by a VPU, orfrom an external unit.

The configuration table may be set as a function of the CPU program orcode section that has been executed.

After arrival of a load instruction, the VPU loads configurations out ofits own memory or a memory shared with the CPU. In particular, a VPUconfiguration may be included in the code of the CPU program beingexecuted at the moment.

After receiving an execution instruction, a VPU executes theconfiguration to be executed and performs the corresponding dataprocessing. The end of data processing may be indicated to the CPU by atermination signal (TERM). Appropriate signal lines/interrupt inputs,etc. are present and/or configured accordingly.

Due to the occurrence of a VPUCODE, wait cycles may be executed on theCPU until the termination signal (TERM) of the termination of dataprocessing by the CPU arrives.

In a preferred embodiment, processing of the next code continues. Ifanother VPUCODE occurs, then it is possible to wait for the precedingcode to be terminated or all the VPCODEs that have been started arequeued in a processing pipeline or a task switch is performed, inparticular as described below.

Termination of data processing is signaled by the arrival of thetermination signal (TERM) in a status register. Termination signalsarrive in the order of a possible processing pipeline.

Data processing on the CPU may be synchronized to the arrival of atermination signal by testing the status register.

In one possible embodiment, a task switch may be triggered if anapplication cannot be continued before the arrival of TERM, e.g., due todata dependencies.

It is preferable if loose links are established between processors andVPUs, in which VPUs function largely as independent coprocessors.

Such a linkage involves one or more shared data sources and data sinks,usually over shared bus systems and/or shared memories. Data isexchanged between a CPU and a VPU via DMAs and/or other memory accesscontrollers. Data processing is preferably synchronized via an interruptcontrol or a status query mechanism (e.g., polling).

A tight linkage corresponds to the direct linkage of a VPU to theinstruction set of a CPU, as described above.

In a direct arithmetic unit linkage, a high reconfiguration performancein particular is important. Therefore, wave reconfiguration ispreferred. In addition, the configuration words are preferably preloadedso that when the instruction is executed, the configuration may beconfigured particularly rapidly (via wave reconfiguration, in theoptimum case within one cycle). It would also be possible to provide aplurality of arrays, identical arrays in particular, instead of apartial array configuration in the case of high-performanceapplications, but also in the case of primarily low-performanceapplications in particular, and to reconfigure at least one of these fora new task, in particular in advance, and then to change easily andcompletely to another array as needed instead of a reconfiguration orpartial reconfiguration of an integral multidimensional coarse-grainedfield which is partially reconfigurable in run time. Signals may be sentto the subarrays, e.g., via MUX/DEMUX stages, in particular I/O signals,data signals, status signals, and/or trigger signals.

For wave reconfiguration, the configurations that are presumably to beexecuted will preferably be recognized in advance by the compiler atcompilation time and preloaded accordingly at run time.

At the time of instruction execution, the corresponding configuration isoptionally selected and executed individually for each PAE and/or for aPAE subset. Such methods are also described in the publicationsidentified above.

A preferred implementation may provide for different data transfersbetween a CPU and a VPU. Three particularly preferred methods that maybe used individually or in combination are described below.

In the case of register linkage, the VPU may take data from a CPUregister, process it and write it back to a CPU register.

Synchronization mechanisms are preferably used between the CPU and theVPU.

For example, the VPU may receive a RDY signal due to the data beingwritten to the CPU register by the CPU and then the VPU may process thedata thus written. Readout of data from a CPU register by the CPU mayresult in an ACK signal, which thus signals to the VPU data acceptanceby the CPU. Use of the conventional RDY/ACK protocol in a differentmanifestation is advantageous in the present case precisely withcoarse-grained cells of reconfigurable units.

CPUs do not typically make similar mechanisms available.

Two possible implementations are described in greater detail.

One approach that is easily implemented is to perform the datasynchronization via a status register. For example, the VPU may indicateto the status register the successful readout of data from a registerand the associated ACK signal and/or input of data into a register andthe associated RDY signal. The CPU first tests the status register andperforms wait loops or task switching, for example, until the RDY or ACKis received, depending on the operation. The CPU will then continue toperform the particular register data transfer.

In an expanded embodiment, the instruction set of the CPU is expanded byadding load/store instructions with an integrated status query(load_rdy, store_ack). For example, a new data word is written into aCPU register only when the register has first been read out by the VPUand an ACK signal has been received. Accordingly, load_rdy reads dataout of a CPU register only when the VPU has previously entered new dataand generated a RDY signal.

Data belonging to a configuration to be executed may be written to theCPU registers and/or may be read out of the registers successively moreor less by block moves as in the related art. Block move instructionsthat are implemented if necessary may preferably be expanded by theintegrated RDY/ACK status query described here.

A plurality of modifications and different embodiments of this basicmethod are possible.

The wave reconfiguration mentioned above allows starting of a new VPUinstruction and the corresponding configuration as soon as the operandof the previous VPU instruction has been accepted from the CPUregisters. The operands for the new instruction may be written directlyinto the CPU register after the instruction start.

According to the wave reconfiguration method, the VPU is reconfiguredsuccessively for the new VPU instruction on completion of dataprocessing of the previous VPU instruction, and the new operands areprocessed.

In addition, data may be exchanged between a VPU and a CPU throughsuitable bus accesses to shared resources.

If there is to be an exchange of data that has been processed by the CPUjust prior to the exchange and therefore is presumably still in thecache of the CPU which is preferably to be provided or if the data isprocessed by the CPU immediately next and therefore is logically placedin the cache of the CPU, this data is preferably read by the VPU out ofthe cache of the CPU or it is written to the cache of the CPU. This maybe determined largely in advance at the compilation time throughsuitable analyses of the application by the compiler and the binary codemay be generated accordingly.

If there is to be an exchange of data that is presumably not in thecache of the CPU and/or is presumably not needed subsequently in thecache of the CPU, it is preferably read directly by the VPU from theexternal bus and the data source connected to it (e.g., memory,peripheral) and/or written to the external bus and the data sinkassociated with it (e.g., memory, peripheral). This may be ascertainedby the compiler largely in advance at compilation time of theapplication through suitable analyses, and the binary code may begenerated accordingly.

In a transfer over the bus bypassing the cache, a protocol between thecache and the bus is preferably implemented, ensuring correct contentsof the cache. For example, the conventional MESI protocol may be usedfor this purpose.

The methods described here need not at first have any particularmechanism for operating system support. It is preferable to ensure thatan operating system to be executed behaves according to the status of aVPU to be supported, which is possible and to which end in particularschedulers may be provided.

In the case of a tight arithmetic unit linkage, the status register ofthe CPU into which the linked VPU enters its data processing status(termination signal) is preferably queried. If further data processingis to be transmitted to the VPU and the VPU has not yet terminated theprevious data processing, the system will wait and/or a task switch willpreferably be performed.

For coprocessor coupling, mechanisms controlled via the operatingsystem, in particular the scheduler, are preferably used.

A simple scheduler may either allow the current task to continue runningon the CPU after transfer of a function to a VPU, if it is able to runindependently and simultaneously with data processing on a VPU. If or assoon as the task must wait for termination of data processing on theVPU, the task scheduler switches to another task.

Each task newly activated will check (if it uses the VPU) before use onwhether the VPU is available for data processing and/or whether it isstill processing data at the present time. Either it must then wait fortermination of data processing or preferably the task is switched.

A simple and nevertheless efficient method may be created by so-calleddescriptor tables which may be implemented as follows, for example.

Each task generates one or more tables (VPUCALL) having a suitable fixeddata format in the memory area assigned to it for callup of the VPU.This table contains all the control information for a VPU such as theprogram/configuration to be executed and/or the pointer to the memorylocation(s) or data sources of the input data and/or the memorylocation(s) or data sinks of the result data and/or additional executionparameters, e.g., data array variables.

The memory area of the operating system contains a table or aninterlinked list (LINKLIST) which points to all the VPUCALL tables inthe order of their creation.

Data processing on the VPU then takes place in such a way that a taskcreates a VPUCALL and calls up the VPU via the operating system. Theoperating system creates an entry in the LINKLIST. The VPU processes theLINKLIST and executes the particular VPU call referenced. Thetermination of the particular data processing is indicated by acorresponding entry in the LINKLIST and/or VPUCALL table.

The VPU thus works largely independently of the CPU. The operatingsystem and/or the particular task must only monitor the tables (LINKLISTand/or VPUCALL).

These two methods are particularly efficient in performance if the VPUused has an architecture which allows reconfiguration that is and/or maybe superimposed on data processing.

It is thus possible to start a new data processing and possibly areconfiguration associated with it, immediately after reading the lastoperands out of the data sources. In other words, it is no longer thetermination of data processing, but instead reading the last operands isnecessary for synchronization. This greatly increases the performance indata processing.

The possible use of an operating system has an additional influence onthe handling of states.

Operating systems use task schedulers, for example, for managingmultiple tasks to permit multitasking.

Task schedulers interrupt tasks at a certain point in time, start othertasks and, after the latter have been processed, resume processing ofthe interrupted task. Locally relevant states may remain unsaved if itis ensured that a configuration (which corresponds to processing of atask) will be terminated only after complete processing—i.e., when alldata and states to be processed within this configuration cycle havebeen saved.

However, if the task scheduler interrupts configurations before theyhave been completely processed, local states and/or data must be stored.In addition, this is advantageous when the processing time of aconfiguration cannot be predicted. In conjunction with the known holdingproblem and the risk that a configuration will not be terminated at all(e.g., due to an error), this also seems appropriate to prevent adeadlock of the entire system.

In other words, taking into account task switching, relevant states mayalso be regarded as states which are necessary for task switching andcorrect restart of data processing.

Thus, in task switching the memory for results and, if necessary, alsothe memory for the operands must be saved and restored again at a laterpoint in time, i.e., on returning to this task. This may be performed bya method comparable to the conventional PUSH/POP instructions andmethods. In addition, the state of data processing, i.e., the pointer tothe last operand processed completely, must be saved. Reference shouldbe made here in particular to PACT18.

Depending on the optimization of task switching, there are two options,for example:

a) The interrupted configuration is reconfigured and only the operandsare loaded. Data processing begins anew as if the processing of theconfiguration had not even been started. In other words, all datacomputations are executed from the beginning, and if necessary,computations are even performed in advance. This option is simple butnot very efficient.b) The interrupted configuration is reconfigured, the operands andresults that have already been calculated being loaded into theparticular memory. Data processing is continued with the operands thathave not been completely computed. This method is much more efficient,but it presupposes that additional states which occur during processingof the configuration may become relevant, if necessary; for example, atleast one pointer to the last operand completely computed must be saved,so that it is possible to begin again with their successors afterreconfiguration.

A particularly preferred variant for managing relevant data is madeavailable through the context switching described below. In taskswitching and/or in executing and switching configurations (see, forexample, patent application PACT15 (PCT/EP02/02398), which is herewithfully included for disclosure purposes) it may be necessary to save dataor states, which are not typically saved together with the working datain the memories for a following configuration because they merely markan end value, for example.

Context switching according to the present invention is implemented byremoving a first configuration while the data to be saved remains in thecorresponding memories (REGs) (memories, registers, counters, etc.).

A second configuration is loaded, connecting the REG in a suitablemanner and in a defined order to one or more global memories.

The configuration may use address generators, for example, to access theglobal memory (memories). The configuration may use address generators,for example, to access REGs designed as memories. According to theconfigured connection between the REGs, the contents of the REGs arewritten into the global memory in a defined order, with the particularaddresses being specified by address generators. The address generatorgenerates the addresses for the global memory (memories) so that thememory areas containing data (PUSH AREA) of the first configuration thathas been removed may be assigned unambiguously.

In other words, different address spaces are preferably provided fordifferent configurations. This configuration corresponds to a PUSH ofconventional processors.

Other configurations then use the resources.

The first configuration should be restarted. Before that, a thirdconfiguration interconnecting the REGs of the first configuration in adefined order is started.

The configuration may use address generators, for example, to access theglobal memory (memories).

The configuration may use address generators, for example, to accessREGs configured as memories.

An address generator generates addresses so that correct access to thePUSH AREA assigned to the first configuration is achieved. The generatedaddresses and the configured order of the REGs are such that the data ofthe REGs is output from the memories and into the REGs in the originalorder. The configuration corresponds to that of a POP of conventionalprocessors.

The first configuration is restarted.

In summary, a context switch is performed so that by loading particularconfigurations which operate like PUSH/POP of conventional processorarchitectures, the data to be saved is exchanged with a global memory.

The function is to be illustrated in an example. A function adds up tworows of numbers, where the length of the rows is not known attranslation time, but instead is known only at run time.

proc example   while i<length do     x[i] = a[i] + b[i]

This function is now interrupted during execution, e.g., by a taskswitch, or because the memory provided for x is full. At this point intime, a, b and x are in memories according to the present invention; iand optionally length must be saved, however.

To do so, the configuration “example” is terminated, with the registercontent being saved and a configuration push being started, reading iand length out of the registers and writing them into a memory.

proc push   mem[<push_adr_example>] = i   push_adr_example++  mem{<push_adr_example>] = length

According to this embodiment, push is terminated and the registercontent may be deleted.

Other configurations are executed. After a period of time, the exampleconfiguration is restarted.

Before that, a configuration pop is started, and it reads the registercontents out of the memory again.

proc pop   i = mem[<push_adr_example>]   push_adr_example++   length =mem[<push_adr_example>]

After execution, pop is terminated and the register contents remainunchanged. The configuration “example” is restarted.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example of a possible system structure.

FIG. 2 shows an example compilation sequence.

FIG. 3 shows the structure of an example VPU.

FIG. 4 shows an example CPU.

FIG. 5 shows an example abstract system definition.

FIG. 6 shows an example interface.

FIG. 7 shows data transfers between VPU and CPU.

FIG. 8 shows a memory area of the operating system.

DETAILED DESCRIPTION

FIG. 1 illustrates an example of, in accordance with the presentinvention, an example method and shows a possible system structure, aPROCESSOR (0101) being connected to a VPU (0103) via a suitableinterface (0102) for data exchange and status exchange.

A PROGRAM code (0110) is broken down (e.g., by a preprocessor for acompiler) into a portion (0111) suitable for the PROCESSOR and aVPU-suitable portion (0112), for example, according to the extractionmethods described here.

Portion 0111 is translated by a standard compiler (0113) correspondingto the PROGRAM code, the additional code from a database (0114) fordescription and management of the interface (0102) between the PROCESSORand a VPU being previously inserted. Sequential code executable on 0101is generated (0116) and the corresponding programming (0117) of theinterface (0102) is generated if necessary. The standard compiler may beof a type that is available as a conventional commercially availabletool or as a portion of a development environment that is commerciallyavailable. The preprocessor and/or possibly the VPU compiler and/orpossibly the debugger and additional tools may be integrated into anexisting commercially available development environment, for example.

Portion 0112 is translated by a VPU compiler (0115), additional code fordescription and management of the interface (0102) being inserted from adatabase (0114). Configurations executable on 0103 are generated (0118)and, if necessary, the corresponding programming (0119) of the interface(0102) is also generated. It should be pointed out explicitly that inprinciple, compilers as described in DE 101 39 170.6 may also be usedfor 0115.

FIG. 2 shows a basic compilation sequence as an example. In theextraction unit (0202), a PROGRAM (0201) is broken down into VPU code(0203) and PROCESSOR code (0204) according to different methods.Different methods may be used in any combination for extraction, e.g.,instructions in the original PROGRAM (0205) and/or subprogram calls(0206) and/or analytical methods (0207) and/or utilization ofobject-oriented class libraries (0206 a). The code extracted istranslated, if necessary, and checked for its suitability for theparticular target system (0208), if necessary. Feedback (0209) to theextraction is possible to obtain improvements due to modified allocationof the codes to a PROCESSOR or a VPU and/or a plurality of same.

Thereafter (0211) VPU code 0203 is expanded (0212) using the interfacecode from a database (0210) and/or (0204) is expanded using theinterface code from 0210 to 0213.

The resulting code is analyzed for its performance (0214) and, ifnecessary, feedback (0215) to the extraction is possible to obtainimprovements due to modified allocation of the codes to the PROCESSOR ora VPU.

The resulting VPU code (0216) is forwarded for further translation to adownstream compiler suitable for the VPU. For further translation, theresulting PROCESSOR code (0217) is processed further in any downstreamcompiler suitable for the PROCESSOR.

It should be pointed out that individual steps may be omitted, dependingon the method. Generally, however, at least largely complete code, whichis directly translatable without significant intervention by theprogrammer, or at least without any significant intervention, is outputto the particular downstream compiler systems.

It is thus proposed that a preprocessor means be provided with a codeinput for supplying code to be compiled, with code analyzing means, inparticular code structure and/or data format and/or data streamrecognition and/or evaluation units, and with a segmenting evaluationunit for evaluating a code segmentation performed in response to signalsfrom the code analyzing unit and, if necessary, with an iteration meansfor repeating a code segmentation until stable and/or sufficientlyacceptable values are achieved, and with at least two partial codeoutputs, a first partial code output outputting partial code for atleast one conventional processor, and at least one additional partialcode output outputting code intended for processing by means ofreconfigurable logic units, in particular multidimensional units havingcell structures, in particular register means which processcoarse-grained data and/or logic cells (PAEs) having arithmetic unitsand the like plus allocated register units, if necessary, and/or afine-grained control means and/or monitoring means, such as statemachines, RDY/ACK trigger lines and communication lines, etc. Bothpartial code outputs may be located at one physical output as serialmultiplex outputs.

The database for the interface codes (0210) is constructed independentlyof and prior to the compiler run. For example, the following sources forthe database are possible: predefined by the supplier (0220), programmedby the user (0221) or generated automatically by a development system(0222).

FIG. 3 shows the structure of a particularly preferred VPU. Preferablyhierarchical configuration managers (CTs) (0301) control and manage asystem of reconfigurable elements (PACs) (0302). The CTs are assigned alocal memory for the configurations (0303). The memory also has aninterface (0304) to a global memory which makes the configuration dataavailable. The configuration runs in a controllable manner via aninterface (0305). An interface of the reconfigurable elements (0302) tosequence control and event management (0306) is present, as is aninterface to the data exchange (0307). An interface of thereconfigurable elements (0302) for sequence control and event management(0306) is present as is an interface for data exchange (0307).

FIG. 4 shows details of an exemplary CPU system, e.g., a DSP of theC6000 type (0401) by Texas Instruments. This shows the program memory(0402), data memory (0403), any peripheral device (0404) and EMIF(0405). A VPU is integrated (0408) as a coprocessor via a memory bus(0406) and a peripheral bus (0407). A DMA controller (EDMA) (0409) mayperform any DMA transfers, e.g., between the memory (0403) and the VPU(0408) or the memory (0403) and the peripheral device (0404).

FIG. 5 shows a more abstract system definition. A CPU (0501) is assigneda memory (0502) to which it has reading access and/or writing access. AVPU (0503) is connected to the memory. The VPU is subdivided into a CTportion (0509) and the reconfigurable elements for data processing(0510).

To increase the memory accesses, the memory may have a plurality ofindependent access buses (multiport). In a particularly preferredembodiment, the memory is segmented into a plurality of independentsegments (memory banks), each bank being independently accessible. Allthe segments are preferably located within a uniform address space. Onesegment is preferably available mainly for the CPU (0504) and anothersegment is mainly available for data processing by the VPU (0505) whileyet another segment is mainly available for the configuration data ofthe VPU (0506).

Typically and preferably, a fully configured VPU will have its ownaddress generators and/or DMAs to perform data transfers. Alternativelyand/or additionally, it is possible for a DMA (0507) to be providedwithin the system (FIG. 5) for data transfers with the VPU.

The system includes IO (0508) which may be accessible by the CPU andVPU.

The CPU and VPU may each have dedicated memory areas and IO areas towhich the other has no access.

A data record (0511) which may be in the memory area and/or in the IOarea and/or partially in one of the two is used for communicationbetween the CPU and the VPU, e.g., for exchanging basic parameters andcontrol information. The data record may contain the followinginformation, for example:

-   -   1. Basic address(es) of the CT memory area in 0506 for        localizing the configurations.    -   2. Basic address(es) of data transfers with 0505.    -   3. IO address(es) of data transfers with 0508.    -   4. Synchronization information, e.g., resetting, stopping,        starting the VPU.    -   5. Status information on the VPU, e.g., errors or states of data        processing.

The CPU and the VPU are synchronized by data polling and/or preferablyby interrupt control (0512).

FIG. 6 shows one possible embodiment of the interface structure of a VPUfor tying into a system similar to that shown in FIG. 5. To do so, amemory/DMA interface and/or an IO interface is assigned (0601) to theVPU for data transfer; another system interface (0602) is responsiblefor sequence control such as managing interrupts, starting and stoppingthe processing, exchange of error states, etc.

The memory/DMA interface and/or IO interface is connected to a memorybus and/or an IO bus.

The system interface is preferably connected to an IO bus, butalternatively or additionally, it may also be connected to a memoryaccording to 0511.

The interfaces (0601, 0402) may be designed for adaptation of differentworking frequencies of the CPU and/or the VPU and/or the system; forexample, the system and/or the CPU may currently operate at 500 MHz andthe VPU at 200 MHz.

The interfaces may perform a translation of the bus protocols, e.g., theVPU-internal protocol may be converted to an external AMBA bus protocol.They thus trigger bus protocol translation means and/or are designed forbus protocol translation, in particular bus protocol translation betweenan internal VPU protocol and a known bus protocol. It is also possibleto provide for conversion directly to CPU-internal bus protocols.

The memory/DMA interface and/or the IO interface supports memory accessby the CT to an external memory, which is preferably performed directly(memory mapped). The data transfer of the CT(s) and/or PAC(s) may bebuffered, e.g., via FIFO stages. External memories may be addresseddirectly; in addition, DMA-internal and/or external DMA transfers arealso performed.

Data processing, e.g., the initialization, i.e., the start ofconfigurations, is controlled via the system interface. In addition,status and/or error states are exchanged. Interrupts for the control andsynchronization between the CTs and a CPU may be supported.

The system interface is capable of converting VPU-internal protocols sothat they are converted to external (standard) protocols (e.g., AMBA).

A preferred method of code generation for the system described here isdescribed herein. This method describes a compiler which breaks downprogram code into code for a CPU and code for a VPU. The breakdown isperformed by different methods on different processors. In aparticularly preferred embodiment, the particular codes broken down areexpanded by adding the interface routines for communication between CPUand VPU. The expansion may be performed automatically by the compiler.

The following tables show examples of communication between a CPU and aVPU. The columns are assigned to the particular active function units:CPU, system DMA and DMA interface (EDMA) and/or memory interface (memoryI/F), system interface (system I/F, 0602), CTs and the PAC. Theindividual cycles are entered into the cells in the order of theirexecution. K1 references a configuration 1 that is to be executed.

The first table shows as an example a sequence when using the system DMA(EDMA) for data transfer:

CPU EDMA System I/F CTs PAC Initiate K1 Load K1 Start Configure K1 K1Initiate Start Wait for loading of K1 data data by EDMA Initiate Datatransfer Data reading of read data processing data by EDMA Data transferSignal the end write data of the operation

It should be pointed out that synchronization between the EDMA and theVPU is performed automatically via interface 0401, i.e., DMA transferstake place only when the VPU is ready.

A second table shows a preferred optimized sequence as an example. TheVPU itself has direct access to the configuration memory (0306). Inaddition, data transfers are executed by

DMA circuit within the VPU, which may be fixedly implemented, forexample, and/or formed by the configuration of configurable parts of thePAC.

CPU EDMA System I/F CTs PAC Initiate K1 Start Read the Configure K1configuration K1 Data transfer Start Read data read data K1 Dataprocessing Data transfer Signal the end Write data write data of theoperation

The complexity for the CPU is minimal.

In summary, the present invention relates to methods that permittranslation of a traditional high-level language such as Pascal, C, C++,Java, etc., onto a reconfigurable architecture. This method is designedso that only those portions of the program that are to be translated andare suitable for the reconfigurable target architecture are extracted.The remaining portions of the program are translated onto a conventionalprocessor architecture.

For reasons of simplicity, FIG. 7 shows only the relevant components (inparticular the CPU), although a significant number of other componentsand networks would typically be present.

A preferred implementation such as that in FIG. 7 may provide differentdata transfers between a CPU (0701) and a VPU (0702). The configurationsto be executed on the VPU are selected by the instruction decoder (0705)of the CPU, which recognizes certain instructions intended for the VPUand triggers the CT (0706), so that it loads the correspondingconfigurations out of a memory (0707) assigned to the CT—which may beshared with the CPU in particular or may be the same as the workingmemory of the CPU—into the array of PAEs (PA, 0108).

CPU registers (0703) are provided to obtain data in a registerconnection, to process the data and to write it back to a CPU register.A status register (0704) is provided for data synchronization. Inaddition, a cache is also provided, so that when data that has just beenprocessed by the CPU is to be exchanged, it is still presumably in thecache (0709) of the CPU and/or will be processed immediately thereafterby the CPU.

The external bus is labeled as (0710) and through it, data is read outof a data source (e.g., memory, peripheral device) connected to it, forexample, and/or is written to the external bus and the data sinkconnected to it (e.g., memory, peripheral device). This bus may inparticular be the same as the external bus of the CPU (0712 & dashedline).

A protocol (0711) between cache and bus is implemented, ensuring thecorrect contents of the cache. An FPGA (0713) may be connected to theVPU to permit fine-grained data processing and/or to permit a flexibleadaptable interface (0714) (e.g., various serial interfaces (V24, USB,etc.), various parallel interfaces, hard drive interfaces, Ethernet,telecommunications interfaces (a/b, TO, ISDN, DSL, etc.)) to additionalmodules and/or the external bus system (0712).

According to FIG. 8, the memory area of the operating system contains atable or an interlinked list (LINKLIST, 0801) which points to allVPUCALL tables (0802) in the order in which they are created.

1. A method for translating a program for a system including at least one first processor and a reconfigurable unit, the method comprising: determining from the program, code portions of the program suitable for the reconfigurable unit; and at least one of extracting and separating, remaining code of the program for processing by the first processor.
 2. The method as recited in claim 1, further comprising: appending interface code to the code portions extracted for the first processor to permit communication between the first processor and the reconfigurable unit according to the system.
 3. The method as recited in claim 1, further comprising: appending interface to the code portions extracted for the reconfigurable unit so that communication is enabled between the first processor and the reconfigurable unit according to the system.
 4. The method as recited in claim 1, wherein the determining step includes determining the code portions based on automated analyses.
 5. The method as recited in claim 1, wherein the program includes instructions defining the code portions to be extracted, and wherein the method further comprises automatically analyzing the instructions.
 6. The method as recited in claim 1, wherein the code portions to be extracted are determined based on calls of subprograms.
 7. The method as recited in claim 1, further comprising: providing an interface code which provides at least one of memory linkage, register linkage, and linkage via a network.
 8. The method as recited in claim 1, further comprising: analyzing at least one of the extracted code portions and results achievable with a given extraction; and restarting an extraction with new improved parameters based on the analysis.
 9. The method as recited in claim 1, further comprising: appending control code to the extracted code for at least one of management, control, and communication of the development system.
 10. The method as recited in claim 1, wherein the first processor has a conventional processor architecture, the architecture including at least one of a von-Neumann architecture, Harvard architecture, controller, CISC processor, RISC processor, VLIW processor, or DSP processor.
 11. The method as recited in claim 1, wherein the remaining code is extracted so that it is translatable via any ordinary unmodified compiler that is suitable for the first processor.
 12. A device for data processing, comprising: at least one conventional processor; at least one reconfigurable unit; and an arrangement configured to exchange data and status information between a conventional processor and a reconfigurable unit, the arrangement being configured so that the data and status information exchange is possible therebetween at least one of: i) during processing of one or more programs, ii) without having to interrupt data processing on the reconfigurable processor, and iii) without having to interrupt data processing on the conventional processor. 